Assignment instructions:

Assignment submission (YOUR NAME):

Liz Peterson


library(tidyverse)
library(here)
library(janitor)
library(estimatr)  
library(performance)
library(jtools)
library(gt)
library(gtsummary)
library(MASS) ## NOTE: The `select()` function is masked. Use: `dplyr::select()` ##
library(interactions) 
library(ggridges)

DATA SOURCE:

Reed D. 2019. SBC LTER: Reef: Abundance, size and fishing effort for California Spiny Lobster (Panulirus interruptus), ongoing since 2012. Environmental Data Initiative. https://doi.org/10.6073/pasta/a593a675d644fdefb736750b291579a0. Dataset accessed 11/17/2019.


Introduction

You’re about to dive into some deep data collected from five reef sites in Santa Barbara County, all about the abundance of California spiny lobsters! 🦞 Data was gathered by divers annually from 2012 to 2018 across Naples, Mohawk, Isla Vista, Carpinteria, and Arroyo Quemado reefs.

Why lobsters? Well, this sample provides an opportunity to evaluate the impact of Marine Protected Areas (MPAs) established on January 1, 2012 (Reed, 2019). Of these five reefs, Naples, and Isla Vista are MPAs, while the other three are not protected (non-MPAs). Comparing lobster health between these protected and non-protected areas gives us the chance to study how commercial and recreational fishing might impact these ecosystems.

We will consider the MPA sites the treatment group and use regression methods to explore whether protecting these reefs really makes a difference compared to non-MPA sites (our control group). In this assignment, we’ll think deeply about which causal inference assumptions hold up under the research design and identify where they fall short.

Let’s break it down step by step and see what the data reveals! 📊


Step 1: Anticipating potential sources of selection bias

a. Do the control sites (Arroyo Quemado, Carpenteria, and Mohawk) provide a strong counterfactual for our treatment sites (Naples, Isla Vista)? Write a paragraph making a case for why this comparison is centris paribus or whether selection bias is likely (be specific!).

While if the other factors did stay the same it would be a relatively good comparison, there are some issues that keep it from being a perfect comparison. A perfect counterfactual would mean that all of the other factors besides being an MPA or not would be the same. However, here the locations cannot be exactly the same. Therefore, the distance north that the site is might affect the water / other factors that could impact the lobster health.


Step 2: Read & wrangle data

a. Read in the raw data. Name the data.frame (df) rawdata

b. Use the function clean_names() from the janitor package

# HINT: check for coding of missing values (`na = "-99999"`)

rawdata <- read_csv('data/spiny_abundance_sb_18.csv', na = '-99999') %>%
    clean_names()
unique(rawdata$site)
## [1] "IVEE" "NAPL" "AQUE" "CARP" "MOHK"

c. Create a new df named tidyata. Using the variable site (reef location) create a new variable reef as a factor and add the following labels in the order listed (i.e., re-order the levels):

"Arroyo Quemado", "Carpenteria", "Mohawk", "Isla Vista",  "Naples"
levels = c("AQUE", "CARP", "MOHK", "IVEE", "NAPL")
labels = c("Arroyo Quemado", "Carpenteria", "Mohawk", "Isla Vista",  "Naples")
tidydata <- rawdata %>%
    mutate(reef = factor(site,
                         levels = levels,
                         labels = labels)) %>%
    arrange(reef)

Create new df named spiny_counts

d. Create a new variable counts to allow for an analysis of lobster counts where the unit-level of observation is the total number of observed lobsters per site, year and transect.

spiny_counts <- tidydata %>%
    group_by(site, year, transect) %>%
    summarise(counts = sum(count, na.rm = TRUE),
              mean_size = mean(size_mm, na.rm = TRUE)) %>%
    ungroup()

e. Create a new variable mpa with levels MPA and non_MPA. For our regression analysis create a numerical variable treat where MPA sites are coded 1 and non_MPA sites are coded 0

#HINT(d): Use `group_by()` & `summarize()` to provide the total number of lobsters observed at each site-year-transect row-observation. 

#HINT(e): Use `case_when()` to create the 3 new variable columns
spiny_counts <- spiny_counts %>%
    mutate(mpa = case_when(site %in% c("IVEE", "NAPL") ~ "MPA",
           .default = "non_MPA"),
           treat = case_when(mpa == "MPA" ~ 1,
                             .default = 0))

NOTE: This step is crucial to the analysis. Check with a friend or come to TA/instructor office hours to make sure the counts are coded correctly!


Step 3: Explore & visualize data

a. Take a look at the data! Get familiar with the data in each df format (tidydata, spiny_counts)

b. We will focus on the variables count, year, site, and treat(mpa) to model lobster abundance. Create the following 4 plots using a different method each time from the 6 options provided. Add a layer (geom) to each of the plots including informative descriptive statistics (you choose; e.g., mean, median, SD, quartiles, range). Make sure each plot dimension is clearly labeled (e.g., axes, groups).

Create plots displaying the distribution of lobster counts:

  1. grouped by reef site
  2. grouped by MPA status
  3. grouped by year

Create a plot of lobster size :

  1. You choose the grouping variable(s)!
# plot 1: distribution of lobster counts grouped by reef site

spiny_counts %>% 
    ggplot(aes(x = site, y = counts, color = site)) +
    geom_boxplot() +
    labs(x = "Reef site",
         y = "Lobster count",
         title = "Distribution of lobster counts grouped by reef site")

# plot 2: distribution of lobster counts grouped by MPA status
spiny_counts %>% 
    ggplot(aes(x = mpa, y = counts, color = mpa)) +
    geom_violin() +
    labs(x = "MPA Status",
         y = "Lobster count",
         title = "Distribution of lobster counts grouped by MPA status")

# plot 3: distribution of lobster counts grouped by year
spiny_counts %>% 
    ggplot(aes(x = counts, y = year, fill = as.factor(year))) +
    geom_density_ridges() +
    labs(x = "Lobster counts",
         y = "Year",
         title = "Distribution of lobster counts grouped by year")

# plot 4: distribution of lobster size grouped by site
tidydata %>% 
    ggplot(aes(x = size_mm, fill = site)) +
    geom_histogram(position = "identity") +
    labs(x = "Lobster size (mm)",
         y = "Lobster count",
         title = "Distribution of lobster size grouped by site")

c. Compare means of the outcome by treatment group. Using the tbl_summary() function from the package gt_summary

# USE: gt_summary::tbl_summary()
spiny_counts %>%
    dplyr::select(counts, mean_size, treat) %>%
    tbl_summary(by = treat,
                statistic = list(all_continuous() ~ "{mean}"))
Characteristic 0
N = 133
1
1
N = 119
1
counts 23 28
mean_size 73 76
    Unknown 15 12
1 Mean

Step 4: OLS regression- building intuition

a. Start with a simple OLS estimator of lobster counts regressed on treatment. Use the function summ() from the jtools package to print the OLS output

b. Interpret the intercept & predictor coefficients in your own words. Use full sentences and write your interpretation of the regression results to be as clear as possible to a non-academic audience.

# NOTE: We will not evaluate/interpret model fit in this assignment (e.g., R-square)

m1_ols <- lm(
    counts ~ treat,
    data = spiny_counts
)

summ(m1_ols, model.fit = FALSE) 
Observations 252
Dependent variable counts
Type OLS linear regression
Est. S.E. t val. p
(Intercept) 22.73 3.57 6.36 0.00
treat 5.36 5.20 1.03 0.30
Standard errors: OLS

The intercept value gives us the value on the y-axis when the treat condition is 0, or when the site is not an MPA. We see here that our value is 22.73, meaning there are almost 23 lobsters in non-MPA sites. The value for treat is the slope of the regression line. This means that per one unit treat, which in this case is just 1, or the condition that there is an MPA, we increase 5.36 lobsters.

c. Check the model assumptions using the check_model function from the performance package

d. Explain the results of the 4 diagnostic plots. Why are we getting this result?

check_model(m1_ols,  check = "qq" )

The ““qq” plot checks for the distribution of the residuals. We can assume a normal distribution of the residuals if the dots fall along the green line at 0. However, this is not the case.

check_model(m1_ols, check = "normality")

The “normality” plot should show us a normal distribution curve if the residuals are distributed normally. This, however, is not the case.

check_model(m1_ols, check = "homogeneity")

The “homogeneity” plot should give us horizontal reference line if we have homogeneous variance values.

check_model(m1_ols, check = "pp_check")

The “pp_check” plot shows us the observed data with the model-predicted data. The observed data varies quite heavily from the model-predicted data.

These plots lead us to believe that the relationship between counts and treat is not best described by a linear regression model.


Step 5: Fitting GLMs

a. Estimate a Poisson regression model using the glm() function

#HINT1: Incidence Ratio Rate (IRR): Exponentiation of beta returns coefficient which is interpreted as the 'percent change' for a one unit increase in the predictor 

#HINT2: For the second glm() argument `family` use the following specification option `family = poisson(link = "log")`

m2_pois <- glm(
    counts ~ treat,
    family = poisson(link="log"),
    data = spiny_counts
)

summ(m2_pois, model.fit=FALSE)
Observations 252
Dependent variable counts
Type Generalized linear model
Family poisson
Link log
Est. S.E. z val. p
(Intercept) 3.12 0.02 171.74 0.00
treat 0.21 0.03 8.44 0.00
Standard errors: MLE

b. Interpret the predictor coefficient in your own words. Use full sentences and write your interpretation of the results to be as clear as possible to a non-academic audience.

The predictor coefficient here equals 0.21. For a poisson regression model, we take the log of this value. This gives us the percent change in counts when the treat condition is changed (from being 0, nonMPA, to 1, MPA).

c. Explain the statistical concept of dispersion and overdispersion in the context of this model.

Overdispersion would mean that our variance is greater than our mean. However, it is an assumption that the poisson model makes that the mean is equal to the variance.

d. Compare results with previous model, explain change in the significance of the treatment effect

The poisson model gave us more statistically significant results than the linear model, seeing as the p value for the poisson model was 0.00 and 0.03 for the linear model.

e. Check the model assumptions. Explain results.

It seem that the poisson model did a better job of predicting our values, which makes sense considering the data. A linear regression often struggles with binary values.

f. Conduct tests for over-dispersion & zero-inflation. Explain results.

check_model(m2_pois)

check_overdispersion(m2_pois)
## # Overdispersion test
## 
##        dispersion ratio =    67.033
##   Pearson's Chi-Squared = 16758.289
##                 p-value =   < 0.001

This test tells us that overdispersion is detected in the model. Based on our definition from before, this indicates that the variance is greater than the mean.

check_zeroinflation(m2_pois)
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 0
##             Ratio: 0.00

This test tells us that the model is underfitting zeroes (probable zero-inflation).

g. Fit a negative binomial model using the function glm.nb() from the package MASS and check model diagnostics

# NOTE: The `glm.nb()` function does not require a `family` argument

m3_nb <- glm.nb(
    counts ~ treat,
    data = spiny_counts
)

summ(m3_nb)
Observations 252
Dependent variable counts
Type Generalized linear model
Family Negative Binomial(0.55)
Link log
𝛘²(250) 1.52
p 0.22
Pseudo-R² (Cragg-Uhler) 0.01
Pseudo-R² (McFadden) 0.00
AIC 2088.53
BIC 2099.12
Est. S.E. z val. p
(Intercept) 3.12 0.12 26.40 0.00
treat 0.21 0.17 1.23 0.22
Standard errors: MLE

h. In 1-2 sentences explain rationale for fitting this GLM model.

The GLM model controls overdispersion, which we identified above.

i. Interpret the treatment estimate result in your own words. Compare with results from the previous model.

check_overdispersion(m3_nb)
## # Overdispersion test
## 
##  dispersion ratio = 1.398
##           p-value = 0.088

The overdispersion from before was dealt with.

check_zeroinflation(m3_nb)
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 30
##             Ratio: 1.12

This indicates that the model is overfitting zeros, as it was above.

check_predictions(m3_nb)

Here, we see that the observed data and the model-predicted data are much closer than what we previously had seen.

check_model(m3_nb)


Step 6: Compare models

a. Use the export_summ() function from the jtools package to look at the three regression models you fit side-by-side.

c. Write a short paragraph comparing the results. Is the treatment effect robust or stable across the model specifications.

export_summs(m1_ols, m2_pois, m3_nb,
             model.names = c("OLS","Poisson", "NB"),
             statistics = "none")
OLSPoissonNB
(Intercept)22.73 ***3.12 ***3.12 ***
(3.57)   (0.02)   (0.12)   
treat5.36    0.21 ***0.21    
(5.20)   (0.03)   (0.17)   
*** p < 0.001; ** p < 0.01; * p < 0.05.

When we compare all three models together, we see that they are quite similar. This indicates to us that the treatment effect is relatively robust. The p value for each model falls in the same range as the others. It makes sense, additionally, that the poisson and negative binomial values are more similar than the OLS values because they are scaled by the log.


Step 7: Building intuition - fixed effects

a. Create new df with the year variable converted to a factor

ff_counts <- spiny_counts %>% 
    mutate(year=as_factor(year))

b. Run the following OLS model using lm()

m5_fixedeffs <- lm(
    log(counts+1) ~ treat*year,
    data = ff_counts
)

summ(m5_fixedeffs, model.fit=FALSE)
Observations 252
Dependent variable log(counts + 1)
Type OLS linear regression
Est. S.E. t val. p
(Intercept) 1.95 0.27 7.26 0.00
treat -1.23 0.39 -3.16 0.00
year2013 -0.27 0.38 -0.71 0.48
year2014 0.02 0.38 0.04 0.97
year2015 0.49 0.38 1.30 0.20
year2016 0.61 0.38 1.61 0.11
year2017 1.04 0.38 2.73 0.01
year2018 0.83 0.38 2.18 0.03
treat:year2013 1.16 0.55 2.10 0.04
treat:year2014 1.85 0.55 3.35 0.00
treat:year2015 2.25 0.55 4.08 0.00
treat:year2016 0.95 0.55 1.71 0.09
treat:year2017 1.22 0.55 2.20 0.03
treat:year2018 2.27 0.55 4.12 0.00
Standard errors: OLS

c. Take a look at the regression output. Each coefficient provides a comparison or the difference in means for a specific sub-group in the data. Informally, describe the what the model has estimated at a conceptual level (NOTE: you do not have to interpret coefficients individually)

On a conceptual level, this model has estimated the treatment effect, like we did before, but now this model also accounts for the year as it relates to the treatment effect.

d. Explain why the main effect for treatment is negative? *Does this result make sense?

Because the main effect for treatment is negative, this means that the value of log(counts + 1) decreases as we move from nonMPA to MPA. I’m not sure of a scenario when this would occur.

e. Look at the model predictions: Use the interact_plot() function from package interactions to plot mean predictions by year and treatment status.

f. Re-evaluate your responses (c) and (b) above.

# Hint 1: Group counts by `year` and `mpa` and calculate the `mean_count`
# Hint 2: Convert variable `year` to a factor

interact_plot(m5_fixedeffs, pred = year, modx = treat,
              outcome.scale = "response")

This plot show us that the log(counts+1) increases as we move from nonMPA to MPA. Additionally, it shows us that in 2012, the value was much lower. This could account for that decreasing value we saw above.

g. Using ggplot() create a plot in same style as the previous interaction plot, but displaying the original scale of the outcome variable (lobster counts). This type of plot is commonly used to show how the treatment effect changes across discrete time points (i.e., panel data).

The plot should have… - year on the x-axis - counts on the y-axis - mpa as the grouping variable

# Hint 1: Group counts by `year` and `mpa` and calculate the `mean_count`
# Hint 2: Convert variable `year` to a factor

plot_counts <- ff_counts %>%
    group_by(year, mpa) %>%
    ggplot(aes(x = year, y = counts, color = mpa)) +
    geom_point() +
    geom_line() +
    labs(x = "Year",
         y = "Counts")

# plot_counts %>% ggplot() ...
plot_counts


Step 8: Reconsider causal identification assumptions

  1. Discuss whether you think spillover effects are likely in this research context (see Glossary of terms; https://docs.google.com/document/d/1RIudsVcYhWGpqC-Uftk9UTz3PIq6stVyEpT44EPNgpE/edit?usp=sharing)

    In this specific research scenario, spillover effects seem likely. This mostly here means slight migration on the lobsters part, because of how close together the sites are.

  2. Explain why spillover is an issue for the identification of causal effects

    Spillover keeps researchers from being able to identify causal effect between conditions. The glossary above puts it like this, “it reduces the difference between the control and treatment group means.”

  3. How does spillover relate to impact in this research setting?

    If the lobsters in MPAs and nonMPAs are not 100% a part of that site, the results could be very influenced.

  4. Discuss the following causal inference assumptions in the context of the MPA treatment effect estimator. Evaluate if each of the assumption are reasonable:

    1. SUTVA: Stable Unit Treatment Value assumption

      The first condition for SUTVA is that one unit’s treatment does not affect another unit’s outcome. As we just discussed with the probable spillover effect, we cannot assure this.

    2. Excludability assumption

      The excludability assumption requires that the only thing influencing the lobster count could be the MPA status. It is not possible to assure this fact here.


EXTRA CREDIT

Use the recent lobster abundance data with observations collected up until 2024 (lobster_sbchannel_24.csv) to run an analysis evaluating the effect of MPA status on lobster counts using the same focal variables.

  1. Create a new script for the analysis on the updated data
  2. Run at least 3 regression models & assess model diagnostics
  3. Compare and contrast results with the analysis from the 2012-2018 data sample (~ 2 paragraphs)